OAK-11568 Elastic: improved compatibility for aggregation definitions #2193

thomasmueller · 2025-03-19T15:25:11Z

Analyzer configuration is now lenient, quite similar to the Lucene index behavior. This will allow converting Lucene indexes to Elasticsearch. Warnings are logged where needed.
This PR also removes unused code, and reduces compiler warnings.
The tests in ElasticIndexHelperTest are about problems trying to load files that are not configured (IllegalStateException etc.)
The tests in FullTextAnalyzerCommonTest are about compatibility problems
With the NGram Tokenizer (not filter), behaviour is different between Elastic and Lucene for one case: if the query contains multiple words, the result is found with Lucene, but not with Elastic.

github-actions · 2025-03-19T15:25:41Z

Commit-Check ✔️

sonarqubecloud · 2025-03-31T14:38:04Z

Quality Gate passed

Issues
14 New issues
0 Accepted issues

Measures
0 Security Hotspots
80.4% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

fabriziofortino · 2025-03-31T14:38:01Z

...c/main/java/org/apache/jackrabbit/oak/plugins/index/elastic/index/ElasticCustomAnalyzer.java

+        if ("n_gram".equals(name)) {
+            // OAK-11568
+            // https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html
+            Integer minGramSize = getIntegerSetting(args, "minGramSize", 2);
+            Integer maxGramSize = getIntegerSetting(args, "maxGramSize", 3);
+            TokenizerDefinition ngram = TokenizerDefinition.of(t -> t.ngram(
+                    NGramTokenizer.of(n -> n.minGram(minGramSize).maxGram(maxGramSize))));
+            return ngram;
+        }


This is okay for now. We should structure it better to cover all the possible tokenizers (https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html). This can go in a separate PR.

Yes, I agree!

fabriziofortino · 2025-03-31T14:48:59Z

...c/main/java/org/apache/jackrabbit/oak/plugins/index/elastic/index/ElasticCustomAnalyzer.java

+                name = "hyphenation_decompounder";
+                String hypenator = args.getOrDefault("hyphenator", "").toString();
+                LOG.info("Using the hyphenation_decompounder: " + hypenator);
+                args.put("hyphenation_patterns_path", "analysis/hyphenation_patterns.xml");


Should "analysis/hyphenation_patterns.xml" be installed in the Elastic nodes?

I wanted to use a fixed name, so it is possible to configure it. Installing this would have to be done manually, and we need to document it.

fabriziofortino · 2025-03-31T15:04:39Z

...c/main/java/org/apache/jackrabbit/oak/plugins/index/elastic/index/ElasticCustomAnalyzer.java

+            if (skipEntry) {
+                continue;
+            }
+            String key = name + "_" + i;
+            filters.put(key, factory.apply(name, JsonData.of(args)));
+            if (name.equals("word_delimiter_graph")) {
+                wordDelimiterFilterKey = key;
+            } else if (name.equals("synonym")) {
+                if (wordDelimiterFilterKey != null) {
+                    LOG.info("Removing word delimiter because there is a synonyms filter as well: " + wordDelimiterFilterKey);
+                    filters.remove(wordDelimiterFilterKey);
+                }
+            }


Another option could be the use of https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-multiplexer-tokenfilter.html
We can work on this in a separate PR.

Yes, I also thought about that. I didn't find a very good documentation about it yet.

OAK-11568 Elastic: improved compatibility for aggregation definitions

9635838

thomasmueller requested review from fabriziofortino and nit0906 March 19, 2025 15:25

thomasmueller added 5 commits March 19, 2025 17:36

OAK-11568 Elastic: improved compatibility for aggregation definitions

18412a8

OAK-11568 Elastic: improved compatibility for aggregation definitions

0096dcd

OAK-11568 Elastic: improved compatibility for aggregation definitions

1055323

OAK-11568 Elastic: improved compatibility for aggregation definitions

f1aace1

OAK-11568 Elastic: improved compatibility for aggregation definitions

815f25e

fabriziofortino approved these changes Mar 31, 2025

View reviewed changes

thomasmueller merged commit 25df414 into trunk Mar 31, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OAK-11568 Elastic: improved compatibility for aggregation definitions #2193

OAK-11568 Elastic: improved compatibility for aggregation definitions #2193

thomasmueller commented Mar 19, 2025 •

edited

Loading

github-actions bot commented Mar 19, 2025

sonarqubecloud bot commented Mar 31, 2025

fabriziofortino Mar 31, 2025

thomasmueller Mar 31, 2025

fabriziofortino Mar 31, 2025

thomasmueller Mar 31, 2025

fabriziofortino Mar 31, 2025

thomasmueller Mar 31, 2025

OAK-11568 Elastic: improved compatibility for aggregation definitions #2193

OAK-11568 Elastic: improved compatibility for aggregation definitions #2193

Conversation

thomasmueller commented Mar 19, 2025 • edited Loading

github-actions bot commented Mar 19, 2025

Commit-Check ✔️

sonarqubecloud bot commented Mar 31, 2025

Quality Gate passed

fabriziofortino Mar 31, 2025

Choose a reason for hiding this comment

thomasmueller Mar 31, 2025

Choose a reason for hiding this comment

fabriziofortino Mar 31, 2025

Choose a reason for hiding this comment

thomasmueller Mar 31, 2025

Choose a reason for hiding this comment

fabriziofortino Mar 31, 2025

Choose a reason for hiding this comment

thomasmueller Mar 31, 2025

Choose a reason for hiding this comment

thomasmueller commented Mar 19, 2025 •

edited

Loading